Objective: to work with machine learning algorithms in R to classify and predict spatial variables.
For this exercise we will use a Landsat 8 image from Chile and a shapefile with polygons of different classes for training purposes. Let’s load the terra, e1071, and randomForest packages.
Let’s print and plot the covariates object
print(covariates)
## class : SpatRaster
## dimensions : 893, 1032, 7 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 274095, 305055, -4274535, -4247745 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619)
## sources : LC08_L1TP_232087_20200408_20200422_01_T1_sr_band1.tif
## LC08_L1TP_232087_20200408_20200422_01_T1_sr_band2.tif
## LC08_L1TP_232087_20200408_20200422_01_T1_sr_band3.tif
## ... and 4 more source(s)
## names : band ~tance, band ~tance, band ~tance, band ~tance, band ~tance, band ~tance, ...
## min values : -304, -293, -44, -66, 34, -13, ...
## max values : 6741, 7354, 8243, 8969, 7858, 5895, ...Now we can load the shapefile that will be used for training purposes.
## class : SpatVector
## geometry : polygons
## dimensions : 29, 2 (geometries, attributes)
## extent : 274172.4, 305050.1, -4274117, -4248031 (xmin, xmax, ymin, ymax)
## source : training_sites_LCC.shp
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## names : id Class
## type : <int> <int>
## values : 2 900
## 3 900
## 4 900
Let’s visualise the attribute table of the training shapefile.
data.frame(training)
## id Class
## 1 2 900
## 2 3 900
## 3 4 900
## 4 1 900
## 5 5 100
## 6 6 100
## 7 7 100
## 8 8 400
## 9 9 400
## 10 10 400
## 11 11 400
## 12 12 400
## 13 13 1000
## 14 14 300
## 15 15 300
## 16 16 300
## 17 17 300
## 18 18 200
## 19 19 200
## 20 20 200
## 21 21 200
## 22 22 600
## 23 23 600
## 24 24 500
## 25 25 500
## 26 26 600
## 27 27 800
## 28 28 800
## 29 29 800What are the clases:
Class | Description 100 | crops 200 | forest 300 | grassland 400 | shrubland 500 | wetland 600 | water 800 | urban 900 | barren land 1000 | ice_snow
Besides the different Landsat bands, we can calculate the Normalised Difference Vegetation Index (NDVI) and the Normalised Difference Water Index (NDWI) so we can use them in the classification as well.
\[ NDVI = (infrared - red) / (infrared + red) \]
\[ NDWI = (green – infrared) / (green + infrared) \]
| Bands | Wavelength (micrometers) | Resolution (m) |
|---|---|---|
| Band 1 - Coastal aerosol | 0.43-0.45 | 30 |
| Band 2 - Blue | 0.45-0.51 | 30 |
| Band 3 - Green | 0.53-0.59 | 30 |
| Band 4 - Red | 0.64-0.67 | 30 |
| Band 5 - Near Infrared (NIR) | 0.85-0.88 | 30 |
| Band 6 - SWIR 1 | 1.57-1.65 | 30 |
| Band 7 - SWIR 2 | 2.11-2.29 | 30 |
Once that we calculated the NDVI and NDWI layers, we can add them to the covariates using the add function.
add(covariates) <- c(ndvi, ndwi)
names(covariates)
## [1] "band 1 surface reflectance" "band 2 surface reflectance"
## [3] "band 3 surface reflectance" "band 4 surface reflectance"
## [5] "band 5 surface reflectance" "band 6 surface reflectance"
## [7] "band 7 surface reflectance" "ndvi"
## [9] "ndwi"To train the RF and SVM models we have to create a data.frame object that contains the variable that we want to predict and their respective covariates. For this purpose, we will use the extract function from the terra package.
training_df <- terra::extract(covariates, training)
head(training_df)
## ID band 1 surface reflectance band 2 surface reflectance
## 1 1 415 554
## 2 1 571 710
## 3 1 392 496
## 4 1 253 318
## 5 1 384 517
## 6 1 386 491
## band 3 surface reflectance band 4 surface reflectance
## 1 822 1009
## 2 974 1168
## 3 717 903
## 4 496 646
## 5 799 1057
## 6 797 959
## band 5 surface reflectance band 6 surface reflectance
## 1 1131 1293
## 2 1286 1397
## 3 949 1090
## 4 681 767
## 5 1186 1293
## 6 1055 1259
## band 7 surface reflectance ndvi ndwi
## 1 994 0.05700935 -0.1582181
## 2 1096 0.04808476 -0.1380531
## 3 879 0.02483801 -0.1392557
## 4 629 0.02637528 -0.1571793
## 5 1028 0.05751226 -0.1949622
## 6 969 0.04766634 -0.1393089
summary(training_df)
## ID band 1 surface reflectance band 2 surface reflectance
## Min. : 1.00 Min. : 2.0 Min. : 17.0
## 1st Qu.: 4.00 1st Qu.: 185.0 1st Qu.: 226.0
## Median : 9.00 Median : 261.0 Median : 312.0
## Mean : 9.48 Mean : 596.9 Mean : 685.3
## 3rd Qu.:14.00 3rd Qu.: 408.0 3rd Qu.: 485.0
## Max. :29.00 Max. :5784.0 Max. :6488.0
## band 3 surface reflectance band 4 surface reflectance
## Min. : 41 Min. : -1.0
## 1st Qu.: 322 1st Qu.: 327.0
## Median : 428 Median : 499.0
## Mean : 861 Mean : 942.1
## 3rd Qu.: 678 3rd Qu.: 792.0
## Max. :7417 Max. :8011.0
## band 5 surface reflectance band 6 surface reflectance
## Min. : 93 Min. : 50.0
## 1st Qu.: 572 1st Qu.: 453.0
## Median :1095 Median : 659.0
## Mean :1580 Mean : 897.2
## 3rd Qu.:2112 3rd Qu.:1199.0
## Max. :7232 Max. :4366.0
## band 7 surface reflectance ndvi ndwi
## Min. : 13.0 Min. :-0.36949 Min. :-0.8667
## 1st Qu.: 311.8 1st Qu.: 0.03926 1st Qu.:-0.6267
## Median : 609.0 Median : 0.08043 Median :-0.1940
## Mean : 669.8 Mean : 0.28911 Mean :-0.3304
## 3rd Qu.: 870.0 3rd Qu.: 0.60382 3rd Qu.:-0.1278
## Max. :4308.0 Max. : 1.00948 Max. : 0.6117As we have learned, data is crucial for applying machine learning algorithms. In general terms, the sample should representatively describe the complexity of the problem that you are trying to solve. In other words, you should provide enough training data to capture the relationships between dependent and independent variables. Some points to consider:
Now, we have to replace the IDs of the polygons from the training shapefile with the class values of the different land cover classes.
For simplicity, lets change the names of the predicted variable and the covariates.
names(training_df) <- c("class", paste0("band", 1:7), "ndvi", "ndwi")
str(training_df)
## 'data.frame': 8348 obs. of 10 variables:
## $ class: num 900 900 900 900 900 900 900 900 900 900 ...
## $ band1: num 415 571 392 253 384 386 639 417 388 244 ...
## $ band2: num 554 710 496 318 517 491 787 471 445 317 ...
## $ band3: num 822 974 717 496 799 ...
## $ band4: num 1009 1168 903 646 1057 ...
## $ band5: num 1131 1286 949 681 1186 ...
## $ band6: num 1293 1397 1090 767 1293 ...
## $ band7: num 994 1096 879 629 1028 ...
## $ ndvi : num 0.057 0.0481 0.0248 0.0264 0.0575 ...
## $ ndwi : num -0.158 -0.138 -0.139 -0.157 -0.195 ...As observed, the class variable is numeric. In order to work with classification rather than regression we have to convert this variable to factor.
training_df$class <- as.factor(training_df$class)
str(training_df)
## 'data.frame': 8348 obs. of 10 variables:
## $ class: Factor w/ 9 levels "100","200","300",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ band1: num 415 571 392 253 384 386 639 417 388 244 ...
## $ band2: num 554 710 496 318 517 491 787 471 445 317 ...
## $ band3: num 822 974 717 496 799 ...
## $ band4: num 1009 1168 903 646 1057 ...
## $ band5: num 1131 1286 949 681 1186 ...
## $ band6: num 1293 1397 1090 767 1293 ...
## $ band7: num 994 1096 879 629 1028 ...
## $ ndvi : num 0.057 0.0481 0.0248 0.0264 0.0575 ...
## $ ndwi : num -0.158 -0.138 -0.139 -0.157 -0.195 ...Finally, we can train the models. We will evaluate RF and the four different kernels for SVM. This procedure may take a while.
rf_model <- randomForest(class ~ ., data = training_df)
svm_model_lin <- svm(class ~ ., data = training_df, kernel = "linear")
svm_model_poly <- svm(class ~ ., data = training_df, kernel = "polynomial")
svm_model_rad <- svm(class ~ ., data = training_df, kernel = "radial")
svm_model_sig <- svm(class ~ ., data = training_df, kernel = "sigmoid")We can print the RF model to assess the classification:
print(rf_model)
##
## Call:
## randomForest(formula = class ~ ., data = training_df)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 1.27%
## Confusion matrix:
## 100 200 300 400 500 600 800 900 1000 class.error
## 100 47 0 18 0 0 0 0 0 0 0.2769230769
## 200 0 1308 0 14 0 0 0 0 0 0.0105900151
## 300 6 0 669 8 0 0 3 0 0 0.0247813411
## 400 0 19 12 1157 0 0 0 0 0 0.0260942761
## 500 0 0 0 1 6 0 0 2 0 0.3333333333
## 600 0 0 1 1 1 193 2 9 0 0.0676328502
## 800 0 0 3 0 0 3 66 0 0 0.0833333333
## 900 0 0 0 0 0 1 1 3992 0 0.0005007511
## 1000 0 0 0 0 0 0 0 1 804 0.0012422360Now, we can predict the land cover classes over the Landsat image (this will take even more time).
# Replacing the names
names(covariates) <- c(paste0("band", 1:7), "ndvi", "ndwi")
rf_LC <- predict(covariates, rf_model)
svm_LC_lin <- predict(covariates, svm_model_lin)
svm_LC_poly <- predict(covariates, svm_model_poly)
svm_LC_rad <- predict(covariates, svm_model_rad)
svm_LC_sig <- predict(covariates, svm_model_sig)To evaluate the performance of the machine learning algorithms we will use the verification_sites shapefile.
## class : SpatVector
## geometry : polygons
## dimensions : 27, 2 (geometries, attributes)
## extent : 274178.3, 304992.4, -4274494, -4249372 (xmin, xmax, ymin, ymax)
## source : verification_sites_LCC.shp
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619)
## names : id Class
## type : <int> <int>
## values : 1 100
## 2 100
## 3 100
Let’s visualise the attribute table of the verification shapefile.
data.frame(verification)
## id Class
## 1 1 100
## 2 2 100
## 3 3 100
## 4 4 200
## 5 5 200
## 6 6 200
## 7 7 300
## 8 8 300
## 9 9 300
## 10 10 400
## 11 11 400
## 12 12 400
## 13 13 500
## 14 14 500
## 15 15 500
## 16 16 600
## 17 17 600
## 18 18 600
## 19 19 800
## 20 20 800
## 21 21 800
## 22 22 900
## 23 23 900
## 24 24 900
## 25 25 1000
## 26 26 1000
## 27 27 1000Let’s stack the different models.
results <- rf_LC
add(results) <- c(svm_LC_lin, svm_LC_poly, svm_LC_rad, svm_LC_sig)
print(results)
## class : SpatRaster
## dimensions : 893, 1032, 5 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 274095, 305055, -4274535, -4247745 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619)
## sources : memory
## memory
## memory
## ... and 2 more source(s)
## names : lyr1, lyr1, lyr1, lyr1, lyr1
## min values : 1, 1, 1, 1, 1
## max values : 9, 9, 9, 9, 9Let’s reclassify the results stack with the.
Let’s reclassify the results stack with the.
print(results)
## class : SpatRaster
## dimensions : 893, 1032, 5 (nrow, ncol, nlyr)
## resolution : 30, 30 (x, y)
## extent : 274095, 305055, -4274535, -4247745 (xmin, xmax, ymin, ymax)
## coord. ref. : WGS 84 / UTM zone 19N (EPSG:32619)
## source : memory
## names : lyr1, lyr1, lyr1, lyr1, lyr1
## min values : 100, 100, 100, 100, 100
## max values : 1000, 1000, 1000, 1000, 1000Let’s plot different models.
We will use the extract function to prepare the verification data.frame as we did for the training set.
Now, we have to replace the IDs of the polygons from the verification shapefile with the class values of the different land cover classes.
classes <- data.frame(verification)
for(i in 1:nrow(verification_df)){
verification_df$ID[i] <- classes[verification_df$ID[i],2]
}
names(verification_df) <- c("class", "RF", "SVM_lin", "SVM_poly",
"SVM_rad", "SVM_sig")
head(verification_df)
## class RF SVM_lin SVM_poly SVM_rad SVM_sig
## 1 100 100 300 300 300 300
## 2 100 100 300 300 300 300
## 3 100 100 300 300 300 300
## 4 100 100 300 300 300 300
## 5 100 100 300 300 300 300
## 6 100 100 300 300 300 300
tail(verification_df)
## class RF SVM_lin SVM_poly SVM_rad SVM_sig
## 3095 1000 1000 1000 1000 1000 1000
## 3096 1000 1000 1000 1000 1000 1000
## 3097 1000 1000 1000 1000 1000 1000
## 3098 1000 1000 1000 1000 1000 1000
## 3099 1000 1000 1000 1000 1000 1000
## 3100 1000 1000 1000 1000 1000 1000Finally, we can apply a confusion matrix to assess the accuracy of each method. For this purpose, we will use the confusionMatrix function from the caret package.
confusionMatrix(data = as.factor(verification_df$RF),
reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 100 200 300 400 500 600 800 900 1000
## 100 17 0 51 0 1 0 2 0 0
## 200 0 934 0 5 0 0 0 0 0
## 300 22 0 253 2 2 0 3 0 0
## 400 0 67 14 393 5 2 0 0 0
## 500 0 0 0 0 1 2 0 0 0
## 600 0 0 0 0 0 31 0 4 9
## 800 0 0 5 0 0 0 28 1 0
## 900 0 0 0 0 0 6 3 932 3
## 1000 0 0 0 0 0 0 0 0 302
##
## Overall Statistics
##
## Accuracy : 0.9326
## 95% CI : (0.9232, 0.9412)
## No Information Rate : 0.3229
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9125
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity 0.435897 0.9331 0.78328 0.9825 0.1111111
## Specificity 0.982359 0.9976 0.98956 0.9674 0.9993530
## Pos Pred Value 0.239437 0.9947 0.89716 0.8170 0.3333333
## Neg Pred Value 0.992737 0.9690 0.97516 0.9973 0.9974169
## Prevalence 0.012581 0.3229 0.10419 0.1290 0.0029032
## Detection Rate 0.005484 0.3013 0.08161 0.1268 0.0003226
## Detection Prevalence 0.022903 0.3029 0.09097 0.1552 0.0009677
## Balanced Accuracy 0.709128 0.9653 0.88642 0.9750 0.5552320
## Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity 0.75610 0.777778 0.9947 0.96178
## Specificity 0.99575 0.998042 0.9945 1.00000
## Pos Pred Value 0.70455 0.823529 0.9873 1.00000
## Neg Pred Value 0.99673 0.997391 0.9977 0.99571
## Prevalence 0.01323 0.011613 0.3023 0.10129
## Detection Rate 0.01000 0.009032 0.3006 0.09742
## Detection Prevalence 0.01419 0.010968 0.3045 0.09742
## Balanced Accuracy 0.87592 0.887910 0.9946 0.98089confusionMatrix(data = as.factor(verification_df$SVM_lin),
reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 100 200 300 400 500 600 800 900 1000
## 100 1 0 0 0 2 0 0 0 0
## 200 0 958 0 0 1 0 0 0 0
## 300 30 0 315 2 1 0 3 0 0
## 400 3 42 8 398 1 2 0 0 0
## 500 0 1 0 0 3 3 0 0 0
## 600 0 0 0 0 0 12 0 0 0
## 800 0 0 0 0 0 0 28 2 0
## 900 5 0 0 0 1 5 5 935 7
## 1000 0 0 0 0 0 19 0 0 307
##
## Overall Statistics
##
## Accuracy : 0.9539
## 95% CI : (0.9459, 0.961)
## No Information Rate : 0.3229
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9397
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity 0.0256410 0.9570 0.9752 0.9950 0.3333333
## Specificity 0.9993466 0.9995 0.9870 0.9793 0.9987059
## Pos Pred Value 0.3333333 0.9990 0.8974 0.8767 0.4285714
## Neg Pred Value 0.9877301 0.9799 0.9971 0.9992 0.9980601
## Prevalence 0.0125806 0.3229 0.1042 0.1290 0.0029032
## Detection Rate 0.0003226 0.3090 0.1016 0.1284 0.0009677
## Detection Prevalence 0.0009677 0.3094 0.1132 0.1465 0.0022581
## Balanced Accuracy 0.5124938 0.9783 0.9811 0.9871 0.6660196
## Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity 0.292683 0.777778 0.9979 0.97771
## Specificity 1.000000 0.999347 0.9894 0.99318
## Pos Pred Value 1.000000 0.933333 0.9760 0.94172
## Neg Pred Value 0.990609 0.997394 0.9991 0.99748
## Prevalence 0.013226 0.011613 0.3023 0.10129
## Detection Rate 0.003871 0.009032 0.3016 0.09903
## Detection Prevalence 0.003871 0.009677 0.3090 0.10516
## Balanced Accuracy 0.646341 0.888563 0.9936 0.98544confusionMatrix(data = as.factor(verification_df$SVM_poly),
reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 100 200 300 400 500 600 800 900 1000
## 100 1 0 0 0 0 0 0 0 0
## 200 0 941 0 0 0 0 0 0 0
## 300 25 0 298 2 0 0 3 0 0
## 400 4 60 16 398 5 3 0 0 0
## 500 0 0 0 0 0 0 0 0 0
## 600 0 0 0 0 0 11 0 0 1
## 800 0 0 0 0 0 0 20 9 0
## 900 9 0 9 0 4 8 13 928 16
## 1000 0 0 0 0 0 19 0 0 297
##
## Overall Statistics
##
## Accuracy : 0.9335
## 95% CI : (0.9242, 0.9421)
## No Information Rate : 0.3229
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.913
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity 0.0256410 0.9401 0.92260 0.9950 0.000000
## Specificity 1.0000000 1.0000 0.98920 0.9674 1.000000
## Pos Pred Value 1.0000000 1.0000 0.90854 0.8189 NaN
## Neg Pred Value 0.9877380 0.9722 0.99098 0.9992 0.997097
## Prevalence 0.0125806 0.3229 0.10419 0.1290 0.002903
## Detection Rate 0.0003226 0.3035 0.09613 0.1284 0.000000
## Detection Prevalence 0.0003226 0.3035 0.10581 0.1568 0.000000
## Balanced Accuracy 0.5128205 0.9700 0.95590 0.9812 0.500000
## Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity 0.268293 0.555556 0.9904 0.94586
## Specificity 0.999673 0.997063 0.9727 0.99318
## Pos Pred Value 0.916667 0.689655 0.9402 0.93987
## Neg Pred Value 0.990285 0.994790 0.9957 0.99389
## Prevalence 0.013226 0.011613 0.3023 0.10129
## Detection Rate 0.003548 0.006452 0.2994 0.09581
## Detection Prevalence 0.003871 0.009355 0.3184 0.10194
## Balanced Accuracy 0.633983 0.776309 0.9816 0.96952confusionMatrix(data = as.factor(verification_df$SVM_rad),
reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 100 200 300 400 500 600 800 900 1000
## 100 0 0 0 0 0 0 0 0 0
## 200 0 923 0 0 1 0 0 0 0
## 300 39 0 317 2 3 0 6 0 0
## 400 0 78 6 398 3 3 0 0 0
## 500 0 0 0 0 1 0 0 0 0
## 600 0 0 0 0 0 30 0 0 4
## 800 0 0 0 0 0 0 26 1 0
## 900 0 0 0 0 1 8 4 936 7
## 1000 0 0 0 0 0 0 0 0 303
##
## Overall Statistics
##
## Accuracy : 0.9465
## 95% CI : (0.9379, 0.9541)
## No Information Rate : 0.3229
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9303
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity 0.00000 0.9221 0.9814 0.9950 0.1111111
## Specificity 1.00000 0.9995 0.9820 0.9667 1.0000000
## Pos Pred Value NaN 0.9989 0.8638 0.8156 1.0000000
## Neg Pred Value 0.98742 0.9642 0.9978 0.9992 0.9974185
## Prevalence 0.01258 0.3229 0.1042 0.1290 0.0029032
## Detection Rate 0.00000 0.2977 0.1023 0.1284 0.0003226
## Detection Prevalence 0.00000 0.2981 0.1184 0.1574 0.0003226
## Balanced Accuracy 0.50000 0.9608 0.9817 0.9808 0.5555556
## Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity 0.731707 0.722222 0.9989 0.96497
## Specificity 0.998692 0.999674 0.9908 1.00000
## Pos Pred Value 0.882353 0.962963 0.9791 1.00000
## Neg Pred Value 0.996412 0.996746 0.9995 0.99607
## Prevalence 0.013226 0.011613 0.3023 0.10129
## Detection Rate 0.009677 0.008387 0.3019 0.09774
## Detection Prevalence 0.010968 0.008710 0.3084 0.09774
## Balanced Accuracy 0.865200 0.860948 0.9948 0.98248confusionMatrix(data = as.factor(verification_df$SVM_sig),
reference = as.factor(verification_df$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 100 200 300 400 500 600 800 900 1000
## 100 1 0 26 0 0 0 16 110 0
## 200 0 872 0 0 5 1 0 0 0
## 300 19 0 217 9 0 0 3 22 0
## 400 0 129 38 366 1 6 1 0 0
## 500 0 0 0 0 0 0 0 0 0
## 600 0 0 0 0 0 14 0 218 29
## 800 0 0 0 0 0 0 0 0 0
## 900 19 0 42 25 3 1 16 587 1
## 1000 0 0 0 0 0 19 0 0 284
##
## Overall Statistics
##
## Accuracy : 0.7552
## 95% CI : (0.7396, 0.7702)
## No Information Rate : 0.3229
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6931
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 100 Class: 200 Class: 300 Class: 400 Class: 500
## Sensitivity 0.0256410 0.8711 0.6718 0.9150 0.000000
## Specificity 0.9503430 0.9971 0.9809 0.9352 1.000000
## Pos Pred Value 0.0065359 0.9932 0.8037 0.6765 NaN
## Neg Pred Value 0.9871055 0.9419 0.9625 0.9867 0.997097
## Prevalence 0.0125806 0.3229 0.1042 0.1290 0.002903
## Detection Rate 0.0003226 0.2813 0.0700 0.1181 0.000000
## Detection Prevalence 0.0493548 0.2832 0.0871 0.1745 0.000000
## Balanced Accuracy 0.4879920 0.9341 0.8264 0.9251 0.500000
## Class: 600 Class: 800 Class: 900 Class: 1000
## Sensitivity 0.341463 0.00000 0.6265 0.90446
## Specificity 0.919255 1.00000 0.9505 0.99318
## Pos Pred Value 0.053640 NaN 0.8458 0.93729
## Neg Pred Value 0.990490 0.98839 0.8545 0.98927
## Prevalence 0.013226 0.01161 0.3023 0.10129
## Detection Rate 0.004516 0.00000 0.1894 0.09161
## Detection Prevalence 0.084194 0.00000 0.2239 0.09774
## Balanced Accuracy 0.630359 0.50000 0.7885 0.94882This exercise aims to predict pH over the same area where we applied the land cover classification. For this purpose, we will use the SoilGrids dataset. Specifically:
First, let’s import the 20,000 pH sample sites.
We will select 80% of the sample sites for training purposes and the rest for verification purposes. We will set a seed for reproducibility purposes.
Let’s import the covariates that will be used in the training and prediction steps.
We have to extract the information of the covariates at the training sites. For this, we need a vector layer of them.
training_points <- vect(training, geom=c("Lon", "Lat"), crs = crs(covariates))
print(training_points)
## class : SpatVector
## geometry : points
## dimensions : 16000, 1 (geometries, attributes)
## extent : -71.99887, -71.00113, -38.99881, -37.00119 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## names : pH
## type : <num>
## values : 5.5
## 5.9
## 5.3We will use the extract function from the terra package to generate our training data frame.
training_df <- terra::extract(covariates, training_points)
head(training_df)
## ID Bulk_density_0_5cm Cation_exchange_capacity_0_5cm Clay_content_0_5cm
## 1 1 85 356 282
## 2 2 93 318 230
## 3 3 87 332 291
## 4 4 86 335 244
## 5 5 95 296 173
## 6 6 94 340 287
## Coarse_fragments_0_5cm Nitrogen_0_5cm Organic_carbon_density_0_5cm Sand_0_5cm
## 1 108 626 710 298
## 2 205 737 713 391
## 3 109 565 732 351
## 4 133 626 655 306
## 5 96 625 662 493
## 6 87 527 740 251
## Silt_0_5cm Soil_organic_carbon_0_5cm Soil_organic_carbon_stock_0_30cm
## 1 420 1517 124
## 2 379 1713 115
## 3 358 1528 123
## 4 451 1494 100
## 5 333 1369 84
## 6 462 1390 120Let’s replace the column of IDs of the data frame (that we don’t need) with the pH values.
training_df$ID <- training$pH
names(training_df)[1] <- "pH"
head(training_df)
## pH Bulk_density_0_5cm Cation_exchange_capacity_0_5cm Clay_content_0_5cm
## 1 5.5 85 356 282
## 2 5.9 93 318 230
## 3 5.3 87 332 291
## 4 5.5 86 335 244
## 5 6.1 95 296 173
## 6 5.5 94 340 287
## Coarse_fragments_0_5cm Nitrogen_0_5cm Organic_carbon_density_0_5cm Sand_0_5cm
## 1 108 626 710 298
## 2 205 737 713 391
## 3 109 565 732 351
## 4 133 626 655 306
## 5 96 625 662 493
## 6 87 527 740 251
## Silt_0_5cm Soil_organic_carbon_0_5cm Soil_organic_carbon_stock_0_30cm
## 1 420 1517 124
## 2 379 1713 115
## 3 358 1528 123
## 4 451 1494 100
## 5 333 1369 84
## 6 462 1390 120Now we have everything to train the models.
We can use the varImpPlot function to assess the importance of the covariates.
Now we can predict spacially the pH over the study area.
Now, we can create a stack of the predicted models to later use them for verification purposes.
results <- rf_LC
add(results) <- c(svm_LC_lin, svm_LC_poly, svm_LC_rad)
names(results) <- c("RF", "SVM_lin", "SVM_poly", "SVM_rad")
print(results)
## class : SpatRaster
## dimensions : 837, 442, 4 (nrow, ncol, nlyr)
## resolution : 0.002262443, 0.002389486 (x, y)
## extent : -72, -71, -39, -37 (xmin, xmax, ymin, ymax)
## coord. ref. : lon/lat WGS 84 (EPSG:4326)
## sources : memory
## memory
## memory
## ... and 1 more source(s)
## names : RF, SVM_lin, SVM_poly, SVM_rad
## min values : 2.564704e-14, 9.081915e-02, -9.031603e-02, 9.088693e-02
## max values : 6.437633, 6.373022, 7.265613, 6.438758We have to generate a spatial object with the verification sites as well.
Now, we can extract the pH prediction over the verification sites.
verification_df <- terra::extract(results, verification_points)
head(verification_df)
## ID RF SVM_lin SVM_poly SVM_rad
## 1 1 5.964537 6.158771 5.950130 5.918034
## 2 2 6.101903 6.165129 6.089038 6.046751
## 3 3 5.476773 5.474062 5.329843 5.415686
## 4 4 5.978840 5.965380 5.855315 5.953640
## 5 5 5.911273 5.771236 5.868699 5.866439
## 6 6 5.873360 5.862882 5.888656 5.881538Similarly to what we did for the training data, we have to replace the ID values for the pH ones.
verification_df$ID <- verification$pH
names(verification_df)[1] <- "pH"
head(verification_df)
## pH RF SVM_lin SVM_poly SVM_rad
## 1 6.0 5.964537 6.158771 5.950130 5.918034
## 2 6.0 6.101903 6.165129 6.089038 6.046751
## 3 5.3 5.476773 5.474062 5.329843 5.415686
## 4 6.0 5.978840 5.965380 5.855315 5.953640
## 5 5.9 5.911273 5.771236 5.868699 5.866439
## 6 5.8 5.873360 5.862882 5.888656 5.881538Finally, we can assess the performance of the models. For this purpose, we will use the rmse function from the hydroGOF package.
## Evaluation RMSE
rmse(obs = verification_df$pH, sim = verification_df$RF)
## [1] 0.09447623
rmse(obs = verification_df$pH, sim = verification_df$SVM_lin)
## [1] 0.1340951
rmse(obs = verification_df$pH, sim = verification_df$SVM_poly)
## [1] 0.1413756
rmse(obs = verification_df$pH, sim = verification_df$SVM_rad)
## [1] 0.1037692